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Abstract 

Context. The flare-productivity of an active region is observed to be related to its spatial complex¬ 
ity. Mount Wilson or McIntosh sunspot classifications measure such complexity but in a categorical 
way, and may therefore not use all the information present in the observations. Moreover, such cat¬ 
egorical schemes hinder a systematic study of an active region’s evolution for example. 

Aims. We propose fine-scale quantitative descriptors for an active region’s complexity and relate 
them to the Mount Wilson classification. We analyze the local correlation structure within contin¬ 
uum and magnetogram data, as well as the cross-correlation between continuum and magnetogram 
data. 

Methods. We compute the intrinsic dimension, partial correlation and canonical correlation anal¬ 
ysis (CCA) of image patches of continuum and magnetogram active region images taken from 
the SOHO-MDI instrument. We use masks of sunspots derived from continuum as well as larger 
masks of magnetic active regions derived from magnetogram to analyze separately the core part of 
an active region from its surrounding part. 

Results. We find relationships between the complexity of an active region as measured by its 
Mount Wilson classification and the intrinsic dimension of its image patches. Partial correlation 
patterns exhibit approximately a third-order Markov structure. CCA reveals different patterns of 
correlation between continuum and magnetogram within the sunspots and in the region surround¬ 
ing the sunspots. 

Conclusions. Intrinsic dimension has the potential to distinguish simple from complex active re¬ 
gions. These results also pave the way for patch-based dictionary learning with a view towards 
automatic clustering of active regions. 

Key words. Sun - active region - sunspot - data analysis - classification - image patches - 
intrinsic dimension - partial correlation - CCA 
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1. Introduction 

Active regions (AR) in the solar atmosphere have intense and intrieate magnetie fields that emerge 
from subsurfaee layers to form loops whieh extend into the eorona. When aetive regions undergo 
external foreing sueh as flux emergenee and rearrangement, the system may destabilize. The stored 
magnetie energy is then suddenly released as aeeelerated partieles (eleetrons, protons, ions) and an 
inerease in radiation ealled a flare is observed aeross the entire eleetromagnetie speetrum (Phillips, 
1991). 

The morphology of sunspots is eorrelated with flare oeeurrenee and has therefore reeeived a lot 
of attention. The Mount Wilson elassifieation seheme (Hale et ah, 1919) groups sunspots into four 
main elasses based on the magnetie strueture, that is, on the relative loeations and sizes of eoneen- 
trations of opposite polarity magnetie flux. The sunspots with simplest morphology belong to the 
unipolar a and the bipolar groups. More eomplex morphologies are deseribed as fly when a bipo¬ 
lar sunspot is sueh that a single north-south polarity inversion line eannot divide the two polarities. 
When a fly sunspot group eontains in addition a 6 spot, that is, umbrae of different polarities inside a 
single penumbra, it is labeled as a fly6 group. The presenee of a d eonfiguration, where large values 
of opposite polarity exist elose together, was identified as a warning of the build up of magnetie 
energy stress with an inereased probability of a large flare (Mayfield and Lawrenee, 1985; Sammis 
et ah, 2000). Meintosh (1990) proposes another elassifieation seheme eontaining 60 elasses, thus 
deseribing the magnetie strueture in greater details. The Meintosh elassifieation is the basis for sev¬ 
eral flare foreeasting methods whieh estimate the flare oeeurrenee rate from historieal reeords of 
flares and aetive region elasses (Bornmann and Shaw, 1994), possibly eombining sueh information 
with observed waiting time distribution between flares (Gallagher et ah, 2002; Bloomfield et ah, 
2012 ). 

The Meintosh and Mount Wilson elassifieations are in general earned out manually, and this 
results in ineonsisteneies that stem from human observation bias as well as non-reprodueible eata- 
logs. To overeome these eaveats, some supervised maehine learning methods have been proposed 
to automatieally elassify sunspot groups aeeording to these sehemes. Stenning et al. (2013) ex- 
traet various measurements from eontinuum and magnetogram images, and then feed these into a 
maehine learning elassifier whieh reproduees the Mount Wilson elassifieation. Colak and Qahwaji 
(2008) employ neural networks and supervised elassifieation teehniques to reproduee the Meintosh 
seheme and use those results in a flare foreeasting system (Colak and Qahwaji, 2009). While these 
approaehes reduee the human bias, they do not use the information present in sunspot images in an 
optimal way and make the study of AR dynamie behavior impraetieal. 

Several attempts were made to find quantitative deseriptors of an aetive region’s eomplexity. 
MeAteer et al. (2005) showed that fraetal dimension of an aetive region alone eannot distinguish 
between the various Mount Wilson elasses. The generalization to multifraetal speetrum, where eaeh 
seale has its own fraetal dimension, allowed to study in greater details the evolution of aetive re¬ 
gion in view of distinguishing between quiet and flare-produetive aetive regions. Box eounting 
methods (Georgoulis, 2005; Abramenko, 2005; Conlon et al., 2008) as well as more aeeurate meth¬ 
ods based on eontinuous wavelet transform (Kestener et al., 2010; Conlon et al., 2010) were em¬ 
ployed. Continuous wavelet transforms and energy speetrum were also used with a similar purpose 
in Hewett et al. (2008); MeAteer et al. (2010). 
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Figure 1. An example pateh from the edge of a sunspot in a eontinuum image and its eolumn 
representation. 


Wavelet basis functions act as a microscope to describe local discontinuities and gradients in an 
image, and Ireland et al. (2008) used two multiresolution analyses to compute at various length 
scales the gradients of the magnetic field along lines separating opposite polarities. Using a data 
set of about 10 000 magnetogram images, they showed that, at all length scales, those gradients 
increase going from a to jS, /3y, and /3yd classes. 

However, a wavelet analysis is known to generate artifacts due to the particular shape of the 
specific wavelet functions. Signal representations based on a set of redundant functions called a 
dictionary, were therefore introduced (Mallat and Zhang, 1993). Elad and Aharon (2006) proposed 
the use of a small sized dictionary to find a sparse representation of patches. Specifically, a patch is 
a m X m-pixel neighborhood, and a patch analysis of a n-pixel image will process the x n data 
matrix that collects the overlapping patches. See Figure 1 for a representation. 

As an example of image patch analysis, Elad and Aharon (2006) considered the problem of 
denoising an image corrupted by additive Gaussian noise. They computed a sparse representation 
of patches over a dictionary, thus effectively denoising the patches. The dictionary itself may either 
be fixed a priori or learned from the corrupted patches. An estimate of the noise-free image is 
then obtained by averaging the denoised overlapping patches. Elad and Aharon (2006) showed that 
dictionary learning methods based on patch analysis are more flexible and provide superior results 
in the context of image denoising. 

In this paper, we carry out a patch analysis of a set of sunspots and active region magnetogram 
images that span the four main Mount Wilson classes. We estimate the intrinsic dimension of the 
local patches, and show how it relates to the Mount Wilson classification. We also study patterns 
of local correlation using partial correlation and canonical correlation analysis, which reveal some 
characteristics of simple and more complex active regions. Such analysis also serves as a prepara¬ 
tion to an unsupervised clustering of active region using patch-based dictionary learning which will 
be presented in a companion paper (Moon et al., 2015). 

Section 2 describes our data set. Unlike previous works, our approach combines information 
from two modalities: photospheric continuum images and magnetograms, both obtained by the 
Michelson Doppler Imager (MDI) on board the Solar and Heliospheric Observatory (SOHO) 
spacecraft. We consider 424 active regions spanning the four main Mount Wilson classes. We use 
SMART masks (Higgins et al., 2011) to delineate the boundaries of magnetic active regions, and 
the STARA algorithm (Watson et ah, 201 1) which provides masks for umbrae and penumbrae from 
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Table 1. Number of each AR per Mt. Wilson class. Simple ARs include a and yS groups while 
complex ARs are Py and ySyd groups. 



a 


^y 

fiyd 

Simple 

Complex 

Total 

Number of AR 

50 

192 

130 

52 

242 

182 

424 


the continuum images. These two masks enable us to differentiate between pixels belonging to the 
actual sunspots and pixels featuring the region surrounding the sunspots. 

In Section 3, the intrinsic dimension of the image patches extracted from the two modalities is es¬ 
timated using both linear and non-linear methods. The linear method relies on Principal Component 
Analysis (PCA) (Jolliffe, 2002), while the non-linear method relies on a k-Nearest Neighbor graph 
approach (Costa and Hero III, 2006; Carter et ah, 2010). The latter method also estimates the local 
intrinsic dimension, which has several advantages over a global estimate. We show that the intrinsic 
dimension is related to the complexity of the sunspot groups. 

Section 4 identifies the spatial and modal interactions of the patches at different scales by estimat¬ 
ing the partial correlation and by using canonical correlation analysis (CCA) (Muller, 1982; Nimon 
et ah, 2010). This gives insight about relationships that may exist between active region complexity 
and the correlation patterns. 

This paper expands and refines some of the work in Moon et al. (2014). Whereas Moon et al. 
(2014) used fixed size square pixel regions centered on the sunspot group as input to the analyses, 
in this paper SMART detection masks are used. A larger set of images is considered in all methods 
which enables us to analyze the relationships of intrinsic dimension and correlation with AR com¬ 
plexity. We also explore the partial correlation of patches which was not included in Moon et al. 
(2014). 

2. Data 

The data used in this study are taken from the Michelson Doppler Imager (MDI) instru¬ 
ment (Scherrer et ah, 1995) on board the SOHO Spacecraft. 

Within the time range of 1996-2010, we select a set of 424 ARs as follows. Using the infor¬ 
mation from the Solar Region Summary reports compiled by the Space Weather Prediction Center 
of NOAA http://www.swpc.noaa.gov/ftpdir/forecasts/SRS/, we consider ARs located 
within 30° of the solar meridian. We looked at a maximum of two-hundred instances per Mount 
Wilson types a,/3,/3y, and fly6. Out of this first selection, we removed AR with a longitudinal extent 
smaller than four degrees, and finally we checked if MDI continuum and magnetogram data were 
available. This provides us with a number of ARs in each Mount Wilson class as given by Table 1. 
In our analysis, we also divide the ARs into two groups: simple ARs {a and fh) and complex ARs 
(fiy and jSyd). 

AR are observed using two modalities: photospheric continuum images and magnetogram. 
SOHO-MDI provides almost continuous observations of the Sun in the white-light continuum, in 
the vicinity of the Ni 1 676.78 nm photospheric absorption line. These photospheric intensity images 
are primarily used for sunspot observations. MDI data are available in several processed “levels”. 
We use level-1.8 images, and rotate them with North up. SOHO provides two to four MDI pho¬ 
tospheric intensity images per day with continuous coverage since 1995. We also use the level-1.8 
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line-of-sight (LOS) MDI magnetograms, recorded with a nominal cadence of 96 minutes. The mag¬ 
netograms show the magnetic fields of the solar photosphere, with negative (represented as black) 
and positive (as white) areas indicating opposite LOS magnetic-field orientations. 

As stated in Section 1, SMART masks (Higgins et ah, 201 1) are used to determine the boundaries 
of magnetic active regions from MDI magnetograms. Those masks are applied also on continuum 
images to determine the surrounding part of the sunspot that is affected by magnetic fragments 
as seen in magnetogram images. Similarly, the STARA algorithm (Watson et ah, 2011) provides 
masks for sunspots (umbrae and penumbrae) from MDI continuum and those masks are applied on 
magnetogram images to determine the AR cores corresponding to the sunspots. Combining these 
two types of masks provides thus two sets of pixels within each AR: those belonging to the sunspots 
themselves as found by STARA and those belonging to the magnetic fragments (or background) 
within an AR as found by the difference set between the SMART and STARA masks. 

As in Moon et al. (2014) we use image patch features to account for spatial dependencies using 
square patches of pixels. Thus if a SMART mask of an image has n pixels and we use a mxm patch, 
the corresponding continuum data matrix X is x n where the fth column contains the pixels in 
the patch centered at the fth pixel. The magnetogram data matrix Y is formed in the same way and 

the full data matrix is Z = | ^ | with size 2m^ x n. We let z, denote the fth column of Z. The images 


from both modalities are also normalized prior to analyzing them. 

In image patch analysis, the size of the patch should be no larger than the smallest feature that is 
to be captured. Otherwise, the relevant feature may be suppressed. Additionally, large patches lead 
to high-dimensional estimates which suffer in accuracy from "the curse of dimensionality," which 
refers to the fact that the number of observations must increase at least linearly in the number of 
parameters for accurate estimates to be possible in statistical inference (Biihlmann and Van De Geer, 
2011). Since some sunspot and active region features can be quite small and to limit the effects of 
high dimensionality on our analysis, we primarily use 3x3 patches in each modality although larger 
patches are used in Section 4 when analyzing spatial correlations in the images. 


3. Intrinsic Dimension Estimation 

The goal of this section is to determine the number of intrinsic parameters or degrees of freedom 
required to describe the spatial and modal dependencies using image patches. We consider 3x3 
patches within both the continuum and magnetogram images giving an extrinsic dimension of 18. 
The intrinsic dimension will determine how redundant these 18 dimensions are. In addition, intrin¬ 
sic dimension provides an indicator of complexity which we compare against the Mount Wilson 
classification, similarly to what McAteer et al. (2005) and Ireland et al. (2008) did using fractal 
dimension and gradient strength along polarity separating lines, respectively. More details on the 
concept of intrinsic dimension on manifolds are included in Appendix A.l. 

It is also important to know whether linear analyses can be accurately applied to the data or 
whether non-linear techniques are required. Linear methods have been applied successfully to solar 
images before such as in Dudok de Wit et al. (2013). However, it is not guaranteed that natural 
images are best represented using linear methods as there are cases where non-linear models have 
superior performance (Dobigeon et al., 2014). Thus this is important to investigate both for further 
analysis of the data as in Moon et al. (2015) and for the correlation analysis in Section 4. If the data 
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lie on a nonlinear subspaee and we perform a linear analysis of the data (e.g. partial eorrelation, 
eanonieal eorrelation analysis, or prineipal eomponent analysis), then the results will be only a 
linear approximation of the true relationships and dependeneies of the data. Nonlinear methods 
of analysis would be neeessary to obtain higher aeeuraey in this ease. To answer this question, we 
estimate the loeal intrinsie dimension using a method appropriate for linear subspaees and a method 
appropriate for any (linear or non-linear) smooth subspace and then compare the results. 

3.1. PCA: A Linear Estimator 

Principal Component Analysis (PCA) (Jolliffe, 2002) finds a set of linearly uncorrelated vectors 
(principal components) that can be used to represent the data. PCA has been used previously for 
various purposes in solar-physics and space-weather literature, e.g. to study the background and 
sunspot magnetic fields (Lawrence et ah, 2004; Cadavid et ah, 2008; Zharkova et ah, 2012), for 
analysis of solar wind data (Holappa et ah, 2014), or to reduce dimensionality (Dudok DeWit and 
Auchere, 2007). 

In PCA, the principal components are the eigenvectors of the covariance matrix E: 

y - ( \ 

UyxSyy/’ 

where x and y are random vectors of dimension 9, x being a patch from the continuum image, and 
y the corresponding patch from the magnetogram image. The eigenvalues indicate the amount of 
variance accounted for by the corresponding principal component. A linear estimate of intrinsic 
dimension is the number of principal components that are required to explain a certain percentage 
of the variance. 

By nature, PCA is a global operation and so it provides a global estimate of the intrinsic dimen¬ 
sion. We can obtain more local estimates by performing PCA separately on the areas within the 
sunspots and on the magnetic fragments. These areas are separated using the STARA and SMART 
masks. 

3.2. k-NN: A General Estimator 

The general method we use is a k-Nearest-Neighbor (k-NN) graph approach with neighborhood 
smoothing (Costa and Hero III, 2006; Carter et al., 2010). The intuition behind the method is that 
we grow the k-NN graph from a point z, by adding an edge from z, to z, if Zj is within the k nearest 
neighbors of z,. The growth rate of the total edge length of the graph is related to the intrinsic 
dimension of the data in a way that enables us to estimate it. 

One advantage of the k-NN method, in contrast to global methods such as Levina and Bickel 
(2004), is it provides an estimate of the local intrinsic dimension by limiting the growth of the 
graph to a smaller neighborhood. This provides an estimate of intrinsic dimension at each pixel 
location in the image which allows us to more easily visualize the intrinsic dimension estimates. 
Additionally, when the number of samples within a region of interest is small (such as within a small 
sunspot), this local method provides more accurate estimates of intrinsic dimension than applying 
a global method (such as PCA) since the inclusion of the neighboring pixels results in a higher 
number of samples. Technical explanation of the k-NN method and more details on local intrinsic 
dimension are given in Appendix A.l. 
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Figure!. Examples of the estimated loeal intrinsie dimension using the fc-NN method for an a 
group (top) and a pyd group (bottom). Regions with more spatial structure have lower intrinsic 
dimension. 


3.3. General Results 

We estimate the intrinsic dimension of the image patches within the sunspots and magnetic frag¬ 
ments for all 424 ARs using both the fc-NN approach and PCA, where the extrinsic dimension of the 
joint patches is 18. Figure 2 shows two examples of the estimated local intrinsic dimension using 
the fc-NN method and the corresponding continuum and magnetogram images. One set of images 
corresponds to an a group while the other set is a /3yd group. In these examples, areas with more 
spatial structure, such as within the sunspots, have lower intrinsic dimension. Fewer parameters are 
required to accurately represent structured data than noise and so the intrinsic dimension is lower. 

Table 2 provides the mean and standard deviation of the intrinsic dimension estimates within 
the sunspots and magnetic fragments. These statistics are also provided for ARs within the main 
Mount Wilson classes {a, (3, /3y, and /3yd). We provide PCA results for the cases where we estimate 
the intrinsic dimension as the number of components required to explain 97% and 98% of the 
variance, respectively. For the fc-NN method, we provide the results in two ways. For one, we 
take the mean of local intrinsic dimensions within each image (separating the ‘sunspot’ from the 
‘magnetic fragments’) and then calculate the mean and standard deviation of these means. The 
statistics in this category correspond to the mean and standard deviation of the average intrinsic 
dimension of each image and are more directly comparable to the PCA results. However these 
results may be affected slightly by small sunspot groups. For the other approach, we pool all of the 
local estimates (again separating sunspots from magnetic fragments) and then calculate the mean 
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Table 2. Estimated intrinsic dimension results for different groups of ARs in the form of 
mean+standard deviation. The complex ARs have higher intrinsic dimension within the sunspots 
than the simple ARs but lower intrinsic dimension within the magnetic fragments. 



a 

yS 

Py 

I3y6 

All 

Sunspots A:-NN, pooled 

3.9 ± 1.2 

4.4+ 1.0 

4.4 + 0.9 

4.5 + 0.7 

4.3 + 1.0 

Sunspots ^-NN, means 

4.0 ± 1.0 

4.8 + 1.2 

4.4 + 0.6 

4.4 + 0.4 

4.5 + 1.0 

Sunspots PCA 97% 

3.7+ 0.8 

4.4 + 0.9 

4.3 + 0.6 

4.2 + 0.5 

4.3+ 0.8 

Sunspots PCA 98% 

4.5 + 0.9 

5.2+ 1.0 

5.1 +0.8 

5.0 + 0.5 

5.0+ 0.9 

Fragments ^-NN, pooled 

8.0+ 1.1 

8.0+ 1.2 

7.7 + 1.2 

7.6+ 1.2 

8.0+ 1.2 

Fragments ^-NN, means 

8.0+ 0.4 

7.9 + 0.5 

7.6+ 0.5 

7.6+ 0.5 

7.8+ 0.5 

Fragments PCA 97% 

7.7 + 1.6 

7.1 + 1.2 

6.2+ 1.2 

6.2+ 1.3 

6.8 + 1.4 

Fragments PCA 98% 

9.1 + 1.6 

8.5 + 1.3 

7.5 + 1.2 

7.4+ 1.3 

8.1 + 1.4 


and standard deviation. These results correspond to the mean and standard deviation of the pixels 
within each region and category and are less affected by small sunspot groups. 

From Table 2, it is clear that the intrinsic dimension is lower within the sunspots than in the 
magnetic fragments for all methods. This is expected as there is more spatial structure within the 
images inside the sunspots than in the magnetic fragments, especially in the continuum image. 

The average PCA estimate with a 97% threshold and the average mean fc-NN estimate give similar 
results inside the sunspot while the average 98% PCA estimate is closest to the average mean fc-NN 
estimate within the magnetic fragments. If linear methods were not sufficient to represent the spatial 
and modal dependencies, we would expect the PCA results to be much higher than the fc-NN results 
when using comparable thresholds as more linear than nonlinear components would be required to 
accurately represent the data. However, this close agreement between the general and linear results 
suggests that linear methods are sufficient and that linear dictionary methods would be appropriate 
for these data. 

3.4. Patterns Within the Mount Wilson Groups 

For both the PCA and fc-NN methods, the average estimated intrinsic dimension is lower within the 
sunspots in a groups than in the more complex groups such as f3y6. This is consistent with Figure 2 
and may be related to the lower complexity of a groups. These exhibit more spatially coherent 
images, which can be described using a lower number of basis elements, and hence have a lower 
intrinsic dimension. 

Within the magnetic fragments, the opposite trend occurs where the less complex groups have 
higher intrinsic dimension. This suggests that the magnetic fragments are fewer, weaker, and less 
structured outside of the a and p groups compared to the more complex regions, leading to a more 
noise-like background in their magnetic fragments. This hypothesis is supported by the normalized 
histograms of the mean k-NN estimates of intrinsic dimension and the normalized histograms of 
the pooled k-NN estimates in Figures 3 and 4. The histograms of mean intrinsic dimension show 
that within the magnetic fragments, a groups generally have higher mean intrinsic dimension than 
I3y5 groups. In fact, no a groups have a mean intrinsic dimension less than 7.5 within the magnetic 
fragments. However, the normalized histograms of the individual patch estimates show a significant 
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Figure 3. Normalized histograms of mean estimated intrinsic dimension of a, /3, ySy, and /Syd groups 
using the fc-NN method. The distributions of intrinsic dimension differ by complexity with simpler 
AR groups having higher (resp. lower) intrinsic dimension within the sunspot (resp. magnetic frag¬ 
ments). 


number of patches with intrinsic dimension less than 7.5 within the fragments. This suggests that for 
each a group, the majority of the patches have higher intrinsic dimension in the magnetic fragments 
and are thus more noise-like. In contrast, there are some fiyS groups where the mean intrinsic 
dimension of the magnetic fragments is lower (less than 7.5) and so these magnetic fragments are 
dominated by patches with more structure. 

Table 2 also shows that the standard deviation of the estimates within the sunspots decreases as 
the complexity increases as measured by the Mount Wilson classification scheme. The histograms 
in Figures 3 and 4 can be used to determine the cause. From the histograms, it is clear that within 
the sunspots the intrinsic dimension of a groups does not have a Gaussian distribution. In this case, 
most of the estimates are between 3 and 5. However, there are a significant number of outliers with 
intrinsic dimension greater than 5. The presence of these outliers contributes to the high standard 
deviation. This is in contrast to the intrinsic dimension of /3y and /3y6 groups inside the sunspot 
which have fewer outliers and thus smaller standard deviations. 
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Figure 4. Normalized histograms of pooled local estimates of intrinsic dimension of a, /3, /3y, and 
py5 groups using the A:-NN method. The distributions of intrinsic dimension differ by complexity 
with simpler AR groups having higher (resp. lower) intrinsic dimension within the sunspot (resp. 
magnetic fragments). 


The outliers in the a groups correspond to small sunspots. The number of pixels within the a 
sunspots with average intrinsic dimension > 6 range between 10 and 53 with a median of 16. In 
these cases, the spatial structure of the sunspots may be more similar to the magnetic fragments than 
the spatial structure of larger sunspots. Thus the intrinsic dimension is higher in the small sunspots. 

A similar phenomenon occurs within the jS groups. Note that in Table 2, the average and standard 
deviation of the mean intrinsic dimension of the p groups within the sunspots is higher than for all 
other groups. This is also caused by a few outliers that have high average intrinsic dimension due 
to the small size of the sunspots. When individual local intrinsic dimension estimates of the patches 
from these small sunspots are pooled with the estimates from all other /3 patches, the average in¬ 
trinsic dimension is more aligned with that of the other Mount Wilson types. Additionally, ignoring 
the biggest outliers in the mean intrinsic dimension (defined as having mean intrinsic dimension 
> 6.25) gives an average mean intrinsic dimension of 4.6 for the jS groups which is more aligned 
with the other groups. 
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The distribution of intrinsic dimension within the magnetic fragments also differs by complexity 
based on Figures 3 and 4. The complex ARs have more patches and images with lower intrinsic 
dimension than the simple sunspots which is consistent with Table 2. 

In summary, based on the estimated intrinsic dimension of the image patches, relatively few pa¬ 
rameters are required to accurately represent the data. We have found that the distribution of local 
intrinsic dimension varies based on the complexity of the sunspot group with the more complex 
sunspots having higher (resp. lower) intrinsic dimension within the sunspot (resp. magnetic frag¬ 
ments). Additionally, the standard deviation of the intrinsic dimension is higher within the sunspot 
in the simpler sunspots than the complex ARs. This is due to the presence of small sunspots among 
the simpler ARs that tend to have less spatial structure and thus a higher intrinsic dimension than 
typical sunspots. We have also shown that linear methods should be sufficient to accurately analyze 
the data. 

4. Spatial and Modal Correlations 

The results in the previous section indicate that linear methods are likely sufficient to represent the 
spatial and modal dependencies within a sunspot. We therefore analyze the linear correlation over 
patches using partial correlation and canonical correlation analysis (CCA). 

The partial correlation is proportional to the inverse of the correlation matrix and analyzes the 
pixel-to-pixel correlation when the influence of all other pixels has been removed. It provides insight 
into how large a patch should be used to sufficiently capture the spatial and modal correlations in 
future analysis. 

CCA on the other hand is determined by finding the most correlated linear combinations of pixels 
from each image, solved as a generalized eigenvalue problem, which is useful for determining 
the degree of mutual correlation between two modalities. If the two modalities are independent, 
there is no benefit in processing them together, while if the two modalities are strongly dependent, 
processing only one of the modalities is sufficient since the other modality would not contain any 
additional information. 

4.1. Partial Correlation: Methodology 

The partial correlation measures the correlation between two random variables while conditioning 
on the remaining random variables. The intuition behind partial correlation can be best explained 
with the linear regression concept. Suppose you want to compute the partial correlation between two 
variables X\ and X 2 given a set of variables X. First, compute the linear regression using variables 
in X to explain X\ and obtain the associated residuals . Proceed similarly for X 2 and get residuals 
rx 2 - The partial correlation between X\ and X 2 is then equal to the (usual) correlation between 
and rxj, for which the effect of variables X have been removed. 

In our context, let x be a patch from the continuum image, and |y| be the magnitude (entry- 
wise absolute value) of the corresponding patch from the magnetogram. The partial correlation 

matrix P = ( 1 and its off-diagonal elements can be derived from the inverse correlation 

\ P|y|x P|yllyl / 

matrix (see Appendix A. 2). We use the magnitude of the magnetogram data since both positive and 
negative polarities affect the continuum image in similar ways. 


11 


Moon et al: Image patch analysis of active regions: Intrinsic dimension and correlation 


Sunspots Fragments 




Sunspots 


0.5 

0 



Fragments 


0.5 

0 


Figures. Estimated partial correlation matrices of patch data from within the sunspots and the 
magnetic fragments using 3x3 (left) and 5x5 (right) patches. The theoretical thresholds (Hero and 
Rajaratnam, 201 1) for significance to attain a 0.05 false alarm rate are 0.0070 and 0.0014 for within 
the sunspots and magnetic fragments, respectively when using a 3 x 3 patch. For the 5 x 5 patch, 
the thresholds are 0.0080 and 0.0016, respectively. Statistically insignificant values are set to zero. 


4.2. Partial Correlation: Results 

Figure 5 gives the estimated partial correlation matrices when using 3x3 and 5x5 patches. The 
patches are extracted from all of the active regions and divided using the STARA and SMART 
masks into sunspots and magnetic fragments as before. The partial correlation of 3 x 3 patches is 
quite strong within both modalities. Based on a false alarm rate of 0.05, the theoretical thresholds 
for significance for the partial correlation (Hero and Rajaratnam, 2011) of the 3x3 patches are 
approximately 0.0070 and 0.0014 for within the sunspots and magnetic fragments, respectively. 
For the 5x5 patches, the thresholds are 0.0080 and 0.0016, respectively. Given these thresholds, 
the partial correlation is statistically significant for nearly all values within the modalities (Pxx and 
P|yllyl) using the 3 X 3 patches. 

The cross-partial correlation when using the signed magnetic field (i.e. Pxy and Pyx) is very near 
zero in both regions (not shown). However, when we take the absolute value of the magnetogram 
patches, then the magnitude of the cross-partial correlation (Px|y| and P|y|x) is much higher in both 
regions suggesting that the correlation between the modalities is significant. The partial correlation 
is also stronger in magnitude in all cases within the sunspots than within the magnetic fragments. 

The partial correlation matrices are very structured. In both sunspots and magnetic fragments, the 
pentadiagonal-like structure within the modalities suggests that the image is generally stationary 
with approximately a third order nearest neighbor Markov structure in the pixels. Such structure is 
clearly seen in the matrices for 5 x 5 patches. The cross-partial correlation also has a pentadiagonal- 
like structure although the correlation is not as strong as within the modalities. 

To better see the spatial correlations, in Figure 6 we plot the partial correlation patches taken 
from the columns of the sunspot partial correlation matrix corresponding to the center pixels when 
using 5x5 patches. Figure 6 shows clearly the greater partial correlation within the continuum. 
It also highlights that correlation is slightly higher in magnitude in the vertical direction than the 
horizontal direction. Nearly all sunspots in this study are located within (-30°, -1-30°) from both the 
central meridian and the equator, and so projection effect are unlikely to cause this difference. The 
difference in correlation may be a feature of the sunspots themselves, but this may be difficult to 
determine since the difference in partial correlation is small. 
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Figure 6. Partial correlation patches extracted from the columns in the sunspot partial correlation 
matrix corresponding to the center pixels. The partial correlation is stronger within the continuum. 



Figure 7. Partial correlation matrices within the sunspots using the data from a (left) and /3 ARs 
(center). Statistically insignificant values are again set to zero. (Right) difference between the abso¬ 
lute value of the a and jS matrices. The a sunspots are more (resp. less) strongly correlated within 
(resp. between) the modalities than the /3 groups. 


Some slight differences exist in the partial correlation matrices restricted to certain Mount Wilson 
classes. As an example, Figure 7 contains the partial correlation matrices within the sunspots after 
restricting the data to a and (5 groups as well as the difference between the absolute value of the two 
matrices. The a partial correlation matrix is higher in magnitude within the modalities than the 
matrix but lower between the modalities. Within modalities, the strongest differences (a maximum 
of 0.056 and 0.067 within the continuum and magnetogram, respectively) are in the entries that 
correspond to pixels that are farther away from each other. In contrast, within the cross-partial 
correlation, the strongest differences (a maximum of 0.072) between the two AR types are in the 
entries that correspond to pixels that are close to each other. A similar pattern holds when comparing 
the a matrix to the matrices of the more complex groups. 

Overall, the partial correlation matrices indicate that no larger than a 5 x 5 patch is necessary to 
capture the local spatial dependencies. A 5 x 5 patch of pixels corresponds roughly to the size of a 
mesogranule (Rast, 2003; Rieutord et ah, 2000). This suggests that within the magnetic fragments, 
it is likely that the granules and mesogranules within the photosphere contribute to the local spatial 
dependencies. Within the sunspots, a 5 x 5 patch corresponds to the size of the characteristic length 
of the largest penumbral filaments (Tiwari et al., 2013) which suggests that on average the local 
spatial dependencies are minimal beyond this scale. This analysis, however, does not rule out long- 
range spatial dependencies, which are more difficult to assess due to the large dimensionality. Future 
work will focus on this. 


13 























Moon et al: Image patch analysis of active regions: Intrinsic dimension and correlation 


In the remainder of our analysis, we choose a 3 x 3 patch for the reasons mentioned in Section 2: 
to ensure that we capture the features of small sunspots and to limit the effects high dimension on 
the accuracy of the analysis. Given these concerns, we see that 3x3 patches capture most of the 
spatial correlation. This is evident from Figure 5 where the partial correlation between pixels on 
opposite comers of a 3 x 3 patch is near zero and other pixels that are similarly far away from each 
other have low partial correlation. Thus a 3 x 3 patch strikes a good balance between scale, extrinsic 
dimension, and capturing the spatial correlation. 

4.3. Canonical Correlation Analysis: Methodology 

To further investigate the correlation between the modalities, we use canonical correlation analysis 
(CCA) on the continuum patch x and the magnitude of the magnetogram patch y. CCA finds patterns 
and correlations between two multivariate data sets (Muller, 1982; Nimon et ah, 2010) and was 
used previously in the context of space weather e.g. for the combined analysis of solar wind and 
geomagnetic index data sets (Borovsky, 2014). 

In our application, CCA provides linear combinations of continuum patches x that are most corre¬ 
lated with linear combinations of magnetogram patches y. In other words, all correlations between 
the continuum and magnetogram patches are channeled through the canonical variables. Formally, 
CCA finds vectors a,- and b, for / = l,...,m^ such that the correlation p,- = corr(afx, bf|y|) is 
maximized and the pair of random variables w, = afx and v, = bf|y| are uncorrelated with all 
other pairs Uj and Vj, j 4 i. The variables w, and v, are called the fth pair of canonical variables 
while the vectors a, and b, are the canonical vectors. The solution a, is the fth eigenvector of the 
matrix Z“,j2x|y|2|^|jy|£|y|x which are taken from the covariance matrix. The vector b,- is found simi¬ 
larly (Hardle and Simar, 2007). 

4.4. Canonical Correlation Analysis: Results 

Here we focus on 3 x 3 patches and apply CCA to all 424 image pairs using the magnitude of the 
magnetogram patches. Figure 8 (left and center) shows histograms of the estimated values of pi. 
Within the sunspots, there are many groups with a near perfect correlation between the modalities 
and none of the groups have an estimated value below 0.41. The right plot in Figure 8 plots the 
estimated values of pi vs. the number of samples used within the sunspots. Based on this plot, there 
are many ARs with high correlation and few patch samples suggesting that the correlation may be 
spurious. However, all of the estimated values are statistically significant as defined by the threshold 
given by Hero and Rajaratnam (201 1) using a false alarm rate of 0.05 (shown as the magenta line 
in Figure 8). 

The histogram of pi within the magnetic fragments (Fig. 8, center) is quite different from the 
sunspot histogram (Fig. 8, left). pi is generally lower within the magnetic fragments than within 
the sunspots which is consistent with the results in Figure 5. All of the pi values are statistically 
significant. 

The distributions of p\ differ slightly when comparing simple sunspot groups (a and (3) with 
complex groups (fiy and I3y6). Figure 9 shows that complex groups generally have lower correla¬ 
tion between the modalities within the sunspots than the simpler groups. The estimated Hellinger 
distance (see Appendix A. 3) between the distributions using the divergence estimator in Moon and 
Hero III (2014a,b) is 0.22. Based on the central limit theorem of the estimator (Moon and Hero III, 
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Figures. Histograms of estimated pi using CCA for within the sunspots (left) and the magnetie 
fragments (eenter) using 3x3 patehes. Right: Seatter plot of pi values and the number of samples 
available for within the sunspots. All points are above the magenta line whieh gives the threshold 
for statistieal signifieanoe at a false alarm rate of 0.05 (Hero and Rajaratnam, 201 1). 


Sunspots Fragments 




Figure 9. pi histograms of eomplex ifiy and ySyd) and simple {a and yS) regions within the sunspots 
(left) and the magnetie fragments (right). The simple ARs are generally more eorrelated within 
the sunspots but less eorrelated within the magnetie fragments. The differenee between the sunspot 
distributions, as measured by the Hellinger distanee, is statistieally signifieant. 


2014b), this value is statistieally signifieant with a p-value of 1.6x10“^^. At least some of this differ- 
enee is likely due to the smaller size of the simpler groups (and thus smaller sample size). However, 
it is unlikely to fully explain the differenee given that there are many simple sunspot groups with 
high eorrelation and suffieient sample size. 

Within the magnetie fragments, there are many more simple regions than eomplex regions with 
Pi < 0.4 (see the histogram in Figure 9, right). This eould be related to the same phenomena that 
eauses the intrinsie dimension to be higher within the magnetie fragments of simple sunspot groups 
observed in Seetion 3.3. However, the estimated Hellinger distanee between these distributions is 
0.016. Using the same statistieal test, this estimate is not statistieally signifieant with a p-value of 
0.31. Thus the distributions are not statistieally different from eaeh other. 

To analyze the spatial patterns that produee the highest eorrelation between modalities, we apply 
CCA to the entire data set. Figure 10 plots p, for z = 1,..., 9 for within the sunspots and within the 
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Figure 10. (Left) Plot of the estimated p, using CCA on the entire data set for i = 1,... ,9. All 
values are statistieally signifieant (Hero and Rajaratnam, 2011). (Right) Canonieal patehes a, (top) 
and b, (bottom) for i = 1,..., 6 when using the entire data set from within the sunspots. The b,s are 
approximately equal to the negative of the a,s. 


magnetie fragments. The p, are all statistieally signifieant. Notiee that the p, are higher within the 
sunspots than the magnetie fragments whieh is eonsistent with the results in Figures 5 and 8. 

Figure 10 shows the eanonieal patehes a, and b, for i = 1,..., 6 when using all the data from 
within the sunspots. These are the spatial patterns within the two modalities that are most eorre- 
lated with eaeh other. The canonieal patches have a “saddle-like” appearance where the gradient is 
positive in some directions and negative in others. For example, in a 4 , the pixels to the left and right 
of the center are very negative but the pixels in the corners are all very positive. Note that these 
vectors correspond to centered values with respect to the mean patches. 

Comparing the a,s to the b,s shows that the b;S are approximately equal to the negative of the 
a,s. This makes sense as sunspots within the continuum images correspond to a decrease in value 
relative to the background while ARs within the magnitude of the magnetogram images correspond 
to an increase in value relative to the background. 

We also performed CCA separately on the data from the Mount Wilson classes. Figure 11 plots 
the Pi values for each class and the first canonical patches ai and bi. For p\ and p 2 , the values for 
each class decrease in order of complexity (a, /?, /3y, Py5). This is consistent with our comparison 
of the partial correlation matrices in Figure 7 where the partial correlation was generally higher (in 
magnitude) for the a groups than the others. This is also consistent with the intrinsic dimension 
analysis in Section 3 where the intrinsic dimension generally increases with complexity. This is be¬ 
cause if the correlation between and within modalities is higher, then fewer parameters are required 
to accurately describe the data which results in a lower intrinsic dimension. 

The canonical patches ai and bi have similar patterns across the different classes although the 
patches for the fiy class are flipped compared to the others. The magnitude of the values in the fiyS 
patches are also smaller than the those of the other patches. 

Overall, the results of this section suggest that the two modalities are correlated in both the 
sunspots and the magnetic fragments and are therefore not independent. The correlation is stronger 
within the sunspots compared to the magnetic fragments and stronger within the sunspots in simple 
ARs compared to complex ARs. However, the correlation is not perfect and so there may be an 
advantage to including both modalities in the classification of sunspots and flare prediction. 
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Figure 11. (Left) Plot of the estimated p, using CCA on data segregated by Mount Wilson classes 
for i = 1,..., 9 within the sunspots a groups start out with the highest correlation. (Right) Canonical 
patches Hi (top) and bi (bottom) for the Mount Wilson classes within the sunspots. Again, the biS 
are approximately equal to the negative of the ais as in Figure 10 but the patches differ slightly from 
class to class. 

5. Conclusion 

Existing AR categorical classification systems such as the Mount Wilson and McIntosh schemes 
describe geometrical arrangements of the magnetic field at the largest length scale. In this work, we 
have focused on the properties of the ARs at fine length scale. We showed that when we analyze the 
global statistics or attributes of these local properties, we find differences between the simple and 
complex ARs as defined using the large scale characteristics. So by this approach, we are analyzing 
both the large and fine scale properties of the images. Such results might be due to the multi-scale 
properties of the magnetic fields, as evidenced previously in Ireland et al. (2008). 

The local intrinsic dimension based on the k-NN approach combines both continuum and mag¬ 
netogram observations and provides some measure of local regularity for those images. Further 
differences between the Mount Wilson classes may be found by comparing the histograms or dis¬ 
tributions of local intrinsic dimension of each individual AR instead of only comparing the means 
or pooled estimates as we did in this paper. There are several options to perform such comparisons. 
Each histogram could be treated as a vector, or we could consider the underlying probability density 
function within the framework of functional analysis. Supervised (using Mount Wilson classes) or 
unsupervised classification could be performed. Another option would be to view the set of his¬ 
tograms belonging to a specific class as samples from a distribution of vectors (or a distribution of 
probability density functions). Different classes could then be compared using divergence measures 
such as the Hellinger distance described in Appendix A.3. 

This work also highlighted specific behaviors of the core of active regions (that corresponds to 
the sunspot masks in continuum) and magnetic fragments (the surrounding part of AR), as well 
as the difference of these two regions as a function of the Mount Wilson classification. We found 
that within the sunspots, the spatial and modal correlations are stronger than within the magnetic 
fragments. Additionally, simpler ARs were found to have higher correlation between the modalities 
within the sunspots than the complex ARs. 

This study paves the way for further analysis based on dictionary learning. Knowledge of the in¬ 
trinsic dimension allows us to choose the dictionary size. Moreover the results of Section 3 showed 
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that linear dictionary learning methods are sufficient. The spatial and modal correlation analysis in 
Section 4 justifies a choice of a patch size of 3 x 3 and confirms that both modalities (continuum 
and magnetogram) should be used in dictionary learning. 
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Appendix A: Method Details 

A.l. Intrinsic Dimension Estimation of Manifolds 

Consider data that are described in an extrinsic Euclidean space of d dimensions. However, suppose 
the data actually lie on a lower dimensional manifold Al. Thus the intrinsic dimension m of the data 
corresponds to the dimension of Ad. For example, data may be given to us in a 3 dimensional space 
but lie on the surface of a sphere. Thus the intrinsic dimension of the data would be 2. 

In some cases, data points from the same data set may lie on different manifolds. For example, 
part of the data with an extrinsic dimension of 3 could lie on the surface of a sphere (m = 2) while 
another part may lie on a circle (m = 1). We then say that data points from these different manifolds 
have a different local intrinsic dimension. The local intrinsic dimension gives some measure of the 
local complexity of the image. Additionally, the local intrinsic dimension is useful for dictionary 
learning because we can use it to determine whether different-sized dictionaries should be used for 
different regions, e.g. within the sunspots and outside of the sunspots. 

We now describe the k-NN estimator of intrinsic dimension in more detail. For a set of indepen¬ 
dently identically distributed random vectors Z„ = {zi,..., z„} with values in a compact subset of 
Bf, the k-nearest neighbors of z, in Z„ are the k points in Z„\{z,} closest to z, as measured by the 
Euclidean distance || • ||. The k-NN graph is then formed by assigning edges between a point in Z„ 
and its k-nearest neighbors. The intrinsic dimension is related to the total edge length of the k-NN 
graph and can be estimated based on this relationship. The k-NN graph is then formed by assigning 
edges between a point in Z„ and its k-nearest neighbors and has total edge length defined as 

n 

Ly^k{7.n) = X Zi II* 

;=1 zeNk,i 

where y > 0 is a power weighting constant and A4,r is the set of k nearest neighbors of z,. It has 
been shown that for large n, 

L,,AZ„) = + 6„, 

where a = (m - 'y)/m, c is a constant with respect to a{m), and is an error term that decreases 
to zero a.s. as n ^ oo (Costa and Hero III, 2006). A global intrinsic dimension estimate m is found 
based on this relationship using non-linear least squares over different values of n (Carter et ah, 
2010 ). 
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A local estimate of intrinsie dimension at a point z, ean be found by running the algorithm over 
a smaller neighborhood about z,. The varianee of this loeal estimate is then redueed by smoothing 
via majority voting in a neighborhood of z, (Carter et al., 2010). 

A. 2. Partial Correlation 

Let z be a random veetor with size m. Let E be the eovarianee matrix of z, that is = Co\{zi,Zj), 
and let K = be the inverse of the eovarianee matrix, also ealled the preeision matrix. 

The partial eorrelation between Zi and zj given all the other variables z\{z;, zj} measure the degree 
of eorrelation between these two variables after removing the effeet of the remaining ones. Let P,y 
denote the partial correlation between Zi and zj. It has been shown (Lauritzen, 1996) that P,y ean be 
related to the elements of the precision matrix K as follows: 


A.3. Estimating the Hellinger Distance 

Information divergences are a class of functionals that measure the differenee between two prob¬ 
ability distributions. The most popular divergenee measure is the Kullbaek-Leibler divergence 
(Kullbaek and Leibler, 1951). The Hellinger distance is another divergenee measure and is defined 
as 



where / and g are the two probability densities being eompared. The Hellinger distance is a metric 
which is not true of divergenees in general, we use the nonparametrie divergenee estimator derived 
in Moon and Hero III (2014a,b). In Moon and Hero III (2014a), it was shown that this estimator 
eonverges to the true divergenee with mean squared error eonvergenee rate 0{llT) where T is the 
number of samples from eaeh probability distribution. In Moon and Hero III (2014b), it was shown 
that the distribution of the normalized version of this estimator eonverges to the standard normal 
distribution. We can use this fact combined with a bootstrap estimate (Efron and Tibshirani, 1994) 
of the varianee of the estimator to test the hypothesis that the divergence is zero (and hence the 
distributions are equal). 
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